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© 
text 



A method for determining the degree to which two or more lexical items belonging to a predefined corpus of 
in any given language are semantically related to each other, comprising the following steps: 

a) the retrieval from the said text corpus of a set of sentences in which one or more of the given two 
or more lexical items appear, 

b) the parsing, with the aid of a suitable parsing system, of each of the sentences retrieved, in order 
to determine the syntactic dependency structure of each of the said sentences. 

c) for each sentence retrieved, determining from the obtained syntactic dependency structure the 
contextual relations which the given lexical items have in that sentence. i.e. identifying those items in the 
context which have a syntactic relation to those of the given lexical items which appear in the sentence 
concerned, together with the syntactic relations involved, 

d> determining, for each of the given lexical items, the total number of contextual relations found in 

step c), 

e) determining the number of contextual relations which the given lexical items, have in common, 

f) determining, on the basis of the results obtained in steps d) and e), the degree of overlap between 
the contextual patterns of the given two or more lexical items. 
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Method for determining the semantic relatedness of lex! al items in a text 

The invention concerns a itiethbd for" determining the degree to which two or more lexical items 
(morphemes, words, collocations or phrases) belonging to a predefined text corpus in any given language 
c are semantically related. ' , ' ' ! ' 

, , Knowledge of the semantic relations between two or more 'lexical items in a text has applications in 
5 various fields, including computer programs for Word processing and programs for automatic translation of 
; . texts in one natural language into texts in another natural language. ' * *. 

. Until now it has been customary to base the determination of semantic relatedness on information 
previously entered in a dictionary file. Such dictionary files contain identification codes which indicate, for 
each word in the ^dictionary, what semantic features that word has. Alternatively, ^ system of classification 
70 can be used to classify each word according to its semantic type, or the meaning of each word can be 
analysed into semantic components or primitives. Although such methods are widely applied by linguistics 
researchers they aire highly labour-intensive and difficult to apply consistently on a large scale owing to 
subjective biases, which have a considerable influence on the determination of semantic relations by these 
methods, . ^ , 

75 the present invention has the aim of jshowing how the semantic -relatedness of two or more lexical 
.items can be determined automatically, without involving the personal judgement of the user. 

. .This aim is achieved, according to the invention, through' a method for determining the degree to which 
two or more ^lexical hems belonging to a predefined text corpus in any given language are semantically 
relate/comprising the following steps: 
20 a) the retrieval from the said text corpus of a set of sentences in which one or more of the given two 

or more lexical items appear, 1 ? 

b) the parsing, with the aid of a suitable parsing system, of each of the sentences retrieved, in ord r 
to determine the syntactic dependency structure of each of the said sentences, 

c) for each sentence retrieved, determining from the obtained syntactic dependency structure the 
25 contextual relations which the given lexical items have in that sentence, i.e. identifying those items in the 

context which have a syntactic relation to those of the given lexical items which appear in the sentence 
concerned, together with the syntactic relations involved. ! • : * hO*--.:*.uVo^.i 

d) determining, for each of the given lexical items; .the total number of contextual relations found in 
step c), t ., 

30 e) determining the number of contextual "relations 1 which the given lexical Hems have in common, 

f) determining, on the basis of the resurts.obtained:in steps d) and e), the degree of overlap between 
the contextual patterns of the given two or more lexical items. 

As a result of this method an indication is obtained of the strength of the semantic relation between the 
given two or more lexical Hems. This allows a word processing program, an automatic translation program 
35 or any other such program to ; make an independent and automatic decision, and to carry out other 
processing steps on the basis of that decision. 

Although there are a number of methods of statistical analysis which can be applied in order to 
compute the measure of semantic relatedness, the preferred method is to split step f) into two parts: 

f1) determining the number of common- contextual relations which can be expected by chance alone. 
40 f2) comparing the number obtained by step f1) with the number obtained by step e). ; 

* The comparison in step f2) should preferably be performed by evaluating the, following formula: 
semantic relatedness' = (C-E)/(C + K), "" m i. L . r ^ 

\ where., : ' * " * " ; < : ' - r ; ' : ; - . ? . - , 

C = tine number of common contextual relations obtained by. step e) > - ? 

<5"'' E ='the number of common' contextual* relations which. can. be7 expected b»y chance- alone, as obtained by 
step ft) . -v Ci . , 

K = a constant * ~ '• - *^ - " ; ~ \- ; p-r-.,- .-. 

r * Although 1 the method according to the invention can: -in many,! cases yield good results even with a 
limited number of sentences extracted from the text corpus; -it will usually be preferable to retrieve from the 
60 text corpus, in 1 step a), all sentences in which -one or^more of the given lexical items appears: -The degree of 
semantic relatedness between the given two or more lexical Hems can be determined with the highest 
d gree of confidence when all the contextual 'relations of *:the<:said ( lexical : ftems are, taken into account, in 
ther words when £11 sentences in wfirch dhe df more^of'theigiv8nelex}.caI ( items appears are retrieved from 
the t xt corpus. 

The inv nti n will now be described in greater detail with the aid of some examples of its application. 
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-Measuring the semantic proximity or semantic, distance between tjie words piSCARD and REMOVE. 

As an example of the method according to the Invention, in what;follows the semantic proximity 
between -two words is determined, on. the basis of a number of .sentence;? extracted from an aircraft 
maintenance manual, in this example, only a few sentences .ar> used for each of the two key words, but it 
will be obvious that as many sentences as possible, should be .used in prder to 'obtain reliable results, and 
that preferably the method should be "based on all those sentences, in the whole text corpus (in this case the 
whole maintenance manual), which contain.pne or both of the key words. In the' present example the aim is 
to determine the semantic proximity,,,between the. words pispARD and REMOVE.^The following five 
sentences were iretrievedJiom the corpus, all con^iqipg the word DISCARr^: ' ~ ^ . 

• =[1 ) Remove and DISCARD'the O-rings .<£ and^JZ). . . .' : . ^ \ , . .' ' 

[2],Remove and- DISCARD the split pins (18) and remove the n^ts~(17) and washers (16) from the 
clamp rods (11). • ,-. . . - ., ; ■ - u- ,. r . v ._, t - : 

[3] DISCARD the gasket (9). Y . 

[4] Remove, and DISCARD the two splitpins which safety the autopilot cable end fittings (21). _ 
[5]DISCAFlDmelockwire ; fromthegiandnMts(2), ^ . . , . . 

. With the identification and retrieval of .these sentences, step a) of the method according to the invention 
has been partially completed. (The remaining part of step a) consists in the retrieval of a set of sentences 
containing the word REMOVE, and this part will be discussed below.) Next, as defined in step b) of the 
method each of the sentences retrieved must be parsed with the aid of a suitable parsing system in order 
to determine the syntactic dependency structure of each sentence. Such syntactic analysers or parsers 
require no further explanation for a specialist in this field, for example, the last sentence of the above set 
might be converted by one of the known types of parser to a syntactic dependency tree with the following 
..results:.!.. ■? ..r.-. . • : ; -,-.» i .. . ! • v - i 



■i..- - 



,"T.i .;.!' ,. " _ '.!5 i'.l -!■■ .-.i'i .:.■;;;.■.! ;Vj'/=.. 

30 [GOVERNOR, 'discard , 



35 



40 



r*DIR£CT i OBJECTKi i, lockwire' >.,-.r - iT.- .->■■■. :i r, 

. £ DETERMINER, 'the' J,. , " ' ; 

[PREPOSITIONAL- ADJUNCT, . 'from' . .. 

•glandnuts', 

, ' , v . r . MiDDETERMINER., . 'the* ], ' 

'• : ' • "'• ' ' S * ' I EPITHET, " J (2) "J ) I 1 1 " 



-(The linguistic terms used in the above representation are assumed,to,be familiar to a specialist in this field 
and to : need nd : further elucidation.) .'i- ;;f -..> ■ ■-.■<■„ >■••: ,i . 
: -W Hey- 4 W6rtl>(6f -Words; .»Sioth.-*c^^wo^..happen to -occur- In the same sentence) can now be 
45 extracted from this dependency structure, together with those elements of mexontext which have a d.rect 
relation to the key word (or words). For example, from the above dependency structure for sentence No. 5 rt 
is possible to determine that the: key .word; DISCARD. has a . direct relation to the word, "lockwire . which is 
• iabelled ="DIRECT,OBJECTr. Such -contextual relations, can be extracted from , the obtained dependency 

structure for each sentence in turn. : • , .• ' 

so In addition the dependency structures obtained are also searched for any indirect relation either of the 
key words maf have to another word in its context via a, function word such, as .a preposition or conjunction 
in the depend ' *ey' stnactureVwWch-. would, ,b*a«i*ained for sentence No. 1, for example, the key word 
DISCARD would be- found to have; an indirect r latipn to the other.key. word REMOVE via the conjunction 

"' * '"AND ' ' r - ! ''■■' n ''* *:'•.*■>; i: ">r * i: ' ..Y -3 ' • -* ■(* . 

55 ». jh ream -bbfairiitt^by.: -tabulating sa)l;itl» irefaliona.. wWch- can.be found for the above-rnentioned key 
> words in liho 'syn^acticil pend ncy structures corresponding to .the above sentences is as follows: 
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Sentence 


, Relation 


First word 


Relation 


. Second 
word % I 






remove 


AND 


discard ' " 




2" " 


discard 


OBJECT 


>ring 


" 2 




remove 


AND ; - 


discard ; 


2 


2 


discard 


OBJECT 


pin 


3 


s - ! t *' - 


discard 


OBJECT 


gasket 


■4 : ; • ■ 


1 ... 


remove* 


AND - 


discard ■ 




2 


discard 


OBJECT f 


pin * 


' ■ 5 . '* ■" 




discard 


OBJECT 


lockwire 



is 



20 



25 



The number in the first column of .each row in the above table shows the numfer of .the sentence, 
corresponding toj the numbers used in the above list of sentences, and the. number in the second column 
shows the serial number of the relation found in the given, sentence, in which one or both.of.the key words 
appear. It can be seen that in a few cases a relation exists between the two key words themselves. 

A wholly identical procedure can now be followed for the second key word REMOVE. The "following set 
of five sentences can be extracted from the manual for this purpose: 

fl] Lift the loosened bus-bars (7) from the terminal studs (6) and. REMOVE the contactor (14) from 
the interface (12). „ 

[2] When power to main ac bus 1 (2) is REMOVEd, the following events occur. 

[3] Do not REMOVE the nuts (5). 

[4] REMOVE the lockwire and REMOVE the sensor connector (9) from the receptacle (10). 
[5] REMOVE and discard the split pins (18) and REMOVE the nuts (17) and washers (16) from the 
clamp rods (11). 

After each of these sentences has been subjected to structural analysis and the respective syntactic 
dependency structures have been obtained, the following relations can be extracted; tri 
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Sentence 


Relation 


First word 


1 Relation 


Second word 




1 : 




"ft* , 


AND 


remove \ 




1 . » ; ■ ■; 


2 


remove . 


i OBJECT 


contactor 


35 


1-- a- 

2 , , 


- , 3- 


remove 
: remove 


FROM ; 
OBJECT 


interface 
, power 




3 


1 


remove 


OBJECT , 


nut 




4 


1 


remove * 


OBJECT." 


lockwire* 




4 


2 , - ■ 


remove - r 


AND- \ 


, remove" . . . 


\ 

40 


4 . , 
4 


3.. . 
4 


remove 
remove 


. OBJECT 
FROM 


connector 
receptacle 




5 


1 


remove. 


AND. 


discard 




< 5 . 


• 2 


remove 


OBJECT 


pin" . 




5 


3 


remove 


AND ; 


,remove* . 




5 : 

■ - 5 


• 4 . 

■ 5 . t 


remove 
; remove 


.OBJECT 
OBJFCT 


,niit, . 

washer ( 1N 




■ 5-. . 


•. 6 •- 


remove 


FROM \ 


, rod / \ ^ 








- : i ■ *. ' ? 







so 



55 



. Here too % relations are found between the' key } word itself (REMOVE)" and various Other words, but also 
between /REMOVE and the other key word DISCARD. n '* ' " — ' * 4 - ~ - 

, t . it also appears from the two fabjes above that both key words Have ecmmbsi relations to. identical words 
in their cpntext, as shown in the second fable by ah asterisk^Tfius; for instance; the word "pin" appears in 
the OBJECT relation both to DISCARD arid "to REMOVE; ; ;_ : vd:> > - *' " 

. A cpmparispnjOf.the above two ts^bl^ relations in the context 

makes it possible to find meaningful simRkrities ih the contextual paftoms bf ^ s6mahtically related words 
\ such as. in the present^ - * 

Even with the limited number* of sentences used* in : this^ex^ple, a f ^ contextual 
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lem nts already appear. K -the whole text is^ processed, and all the_ sentences are extracted in which at 
least otie of the IceV words occurs"; th n the total number of, common contextual elem nts will certainly 
increase. The more 'contextual relations the two key words have in common, the smaller will be the 
semantic distance between them," ^ in otW.woreis. "tf» sfirbnger - iiftfie • Similarity r id ntity between th 
meanings or fields of reference- of the two words. In : accordance with the method as defined by the 
invention, statistical methods can now be applied to the above-mentioned lists of relations in order to arrive 
at a numerical measure; of this semantic prowrnity. > . : ; 

This measure of semantic proximity' should be a function of 

(a) the number of contextual *elations th« words being compared have in common, and 
<b) the number of contextual relations which can be found, for each of the key words, in the selected 
set of sentences. (Ideally, the selected set of sentences should be equal to the. total text corpus.) 

Thus in the above example the semantic proximity of the words DISCARD and REMOVE depends not 
only on the number of common relations, such as the OBJECT relation in which the word "pin" appears to 
both words, but also on the total number of contextual relations the words DISCARD and REMOVE have in 
»5 the'text corpus which : serves as the source of lexical knowledge. ' ' . . : r ~ 

There are a large nurriber of possible statistical methods of expressing the degree of semantic 
proximity between two words, the preferred method, however, is to compute the semantic, relatedness 
mentioned in^step f) by subtracting from the number of relations obtained in stepe) the number: which can 
be expected by chance alone, and then dividing the result by the number obtained in step e). increased by 
,20 a constant. In other words, the formula applied is " ' • - •'•■••' ; ■ : "' " 

Semantic proximity '= (OE)/(C + K)i ' : ' ' 

where - ' • . . . . ' . " ' ." . '*' 

C = the number of common" contextual relations - 

E = the number of such relations which can be expected by chance alone < --y 

25 K .= a constant. ' ' "'" ' : .' 

The number of relations to be expected on the basis of chance alone is in theory given by> 

E = A * B/f(N), . . . 

• 'where'"' " ' •'. ' \ • ' ' '••'•' • •"• • '•'• • a • "... 

A = the number "of relations found fbf the first word. '- ■•'■■-' » ^ v. . r: c ;% ?t v 

30 B = the number of relations found for the second word, £ .. .. 

f(N) = a function of~^ ¥ "* * 

Suppose that for the-word DISCARD in -the present example a.tbtal of $00 contextual relations are found- 

in the text that for the word REMOVE a total of 500 relations are found, and that 50 of these relations are 

common to both wordsr Suppose further that for the function ;f(N) ofrthe number of different relations. N. in 
as the corpus of text a value of 15000 has been' established experimentally, and that for the constant K a value 

of 1 is chosen. The number foT common relations to be expected on the basis of chance alone is 

determined by the above formula as: . ' 

E = A * B/f(N) = 300 * 500/15000 = 10. * ''..,•««. 
In accordance with the first of the above formulae, a numerieal value can now be obtained for the 
ao measure of semantic : relatedness,^ semantic proximity in this Case, of the two words DISCARD and 
REMOVE: ~ : ' *' j ' ' ' ' ■' '"" ' , 

proximity = (C>EV(C^K);= (50-1 0)/(50 + 1) = 0.784. _ ^ 

The larger the number ; of common relations, and the Smaller' the expected number of relations, the 
closer the obtained vaiiie will apprbacn unity! '• ■■ • '•- , ^. ..„ . 

45 In practice, computing the value of f(N) wjll not be trivial because the distnbubon of the different 
contextual relations' is not even, arid because it is subject to various kinds of constraint, depending on the 
part of speech, .for, e>^^ value of f(N) can also' be set experimentally by choosing the 

value which "yields "the most accept ~ • " 

The value of K also depends on the application of the method. This constant has a normalizing effect. 
; ,so .first and fotew^-AddJpa .^con W .to. the denominator, of the above expression causes the semantic 
r latedness to be expressed by a number , between zero and unity: On the other hand, this constant a so 
has the effect ef reducing the measure, of. semantic relatedness when this is based on rvery low value of C 
(he. a value which . indicates that the.,number of common relations is small). This> effect can be useful for 
limiting the influence of chance' coincidences. If the numbers are relatively small, then in general the 
>ss .- conclusions which can. be^pw Jn^tnem .^jl,bVless : /»lBble. ; .. v . " 

• ,Mt 'corrtexWaJ retatjpris'-are ^nd, for We^iven lexical items, although 
a certain number of coVnnio^eWto^ would It* expected on ! th grounds of chance alone; In that case the 

• . measure of semantic relatedness" acquires a negative Value. It is preferable in sudh cases to r place the 
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-* term C hi the denominator of the* above expression with the t mv E, so that the yalu s obtained will be 
normaliz d between zero and minus one. The formula then becom s: . ■ . 

relat dness = (C-E)/(E + K). 

Anothen possible way of expressing the degree: of semantic relatedness between two words is to divid 
* the number cf common relations C , by the sum of the total number of relations, A, found for the first word 
and the 1otal number of . relation. B, found for the second word. The result . is a numerical value which 
expresses the semantic relatedness of the two words. In qther words: v , 
relatedness = ;C/(A + B)/ ,. .... y • ^ „ _ .<' a ," • , , - Z 

-•" where -r* ■ \ *. ... — ,.j , .» . ^ .. , ■ / \ . , 

10 A = the total number >of relations for the first word. , . - ' V 

. jB\ = the total number of relations for. the second word,, , ,...*• ' 

C.=^the number of common relations. .* .> 

This formula yields a value which, depending on ( the numbers involved, will lie between p and 1/2 for 
two key words, or between 0 and 1/3 for three key words. Since there is a theoretical! upper limit for 
75 semantic relatedness (namely complete synonymity), it is convenient to again normalize the measure of 
relatedness between zero and unity, as in the preferred method discussed above. This can be done by 
multiplying the numerator in the above expression by the number of key words involved in the comparison. 
Thus; in general:; \ f . -n . ^ f * . 

relatedness = (number of key words) C/(A + B). t . . t , .J 

20 Suppose once more that for the word DISCARD in the present example a w total: of 300 contextual 
relations are found in the text, that for the, word REMOVE a total of 500 relations are found, and that 50 of 
-these relations, are common to both words.. The numerical measure of semantic, relatedness. or semantic 
proximity in this case, for the two words DISCARD and REMOVE is given by 2 * 50/(300-+ 500) = 0.125. 
The larger the number of common relations, the closer the measure of relatedness obtained approaches 
25 unity. ■ ... " ^ v .'*.:.,.-.'"*' - i 

Such a measure of semantic, distance or proximity can be applied in practice in the production of 
machine translations.zfor. example. By way of Hlustration,.thQ English word "smooth" and its various French 
translations will be considered. The word "smooth" has a number of possible equivalents, in French, with 
clearly different meanings:;"lisse" f "unIV?ppfi"v'"dow^ , " : 3i<; v V!. ^' ^ 

- f r. In such cases this,- where-a: single word can, be translated mto : another languaget^in "several different 
ways, with different meanings, it is common practice, in conventional dictionaries to-au^me^t, the entry in 
question: with ^number of codified contextual.references, and to place these in a bilingual word list together 
with the relevant meanings on translations, e.g.: % ..- -, ... . 

smooth (leather) l= lisse . :: .. , , , ; , b 

as smooth (road) = iini v , , ; ; : . r ^ . ' 

smooth (glass) = poli , ^ , it-./ : \ . V 

smooth (skin) = doux . .. ■•=» v : » x ■ .. 

smooth (talk) = insinuant -- ( ,x< t - y • v -. „ - - - - . " . 

The problem then is to deduce from the text being -translated which of, the meanings, is appropriate in 
40 the current context and thus how ,the word ;in. question, js 4o bes translate^ if the word 

"smooth" appears in the combination ."smooth path";, the system needs to be able to, decide which of the 
translations given in the dictionary is jnost appropriate, ue.^ which translation of "smooth" fits best in the 
context of "path". In this example.: the most appropriate: French word will presumably be "uni". Now if a text 
corpus is searched using the method defined by the invention, a semantic, proximity index can be worked 
45 out for each of the contextual examples in the^dictipnaryr and this will shqwjhat, jn^yiew of the number of 
- common relations found, there us a high degree* of semantic proximity ./b^een thje^v^ords "path" and 
"road";, whereas the measure of proximity to^ the othei~ dictionary, exampjes, w% be >mu.cjh lower. On these 
grounds the system can :decide that the French word : "unj* te^h© corr§c^ translation of "smobth\ 
^. This example shows why the number of qprnmpn relations must be: considered in relation, to the total 
so- - number of relations found, for each word; If vvords A , and B r haye r 50 relations ; in gommon^, for instance, 
l - * whereas words A 'and' C have on|y;10 relation^ in cpm.moni. then; the cqnclusion can be drawn that A is 
closer in meaning to B than to C, always, provided that jhp total number of relations- foundJn the text is the 
: same for B as^fdr C. If. on the other hand, th^ tqtals are djfferen^rOTs tectpr m^ account 
The finding of 10 common relations between A and C may be statisticajiy.-mor , r significant.than the 50 
55 common r lations between A and B. if B is^afiighrfrequenQy^wprc^ s^ C is a relatively rare 

; ; - wording, rgasketr/ v .:;c^rH tu '^r^t V^:^\'\'- i- ^ 
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Example 2 '^Measuring the degree of semarttic association between two words such as PRESSURE and 

valve. i - : < ■' " ,r --" ; ; ;; 

* • ■ Before this ^example is discuss^d^in^detair ft must benpolnted-rout that there.Js a difference between 
s semantic association ^ types; of semantic relatedness. The words 

PRESSURE and VALVE are certainly hot simHdr in meaning? one word (pressure) referring? to an abstract 
concept and the other (valve) referring 4 to a concrete piede of equipment The semantic distance between 
them should therefore be relatively large, i.e. the numerical measure of semantic proximity, should be low. 
However, the method described above can also be successfully applied to determine the. degree of 
to semantic association instead of semantic distance v dr proximity, as will be illustrated i>elow^ < - 

Just as in example 1, the two key words 'PRESSURE and VALVE are used to retrieve from aicorpus of 
text that set of sentences in which at least one of the key words occurs: This time, howeVer. only those 
- 1 sentences are retained in which bdfh key words appear. Teh such sentences extracted frortva sample text 

' are shown below: : ' • vv * ,: ' v ' * "" ' " tV * *' ** ■ " ' 

' T5 1 [1 ] a temperature-compensated PRESSURE switch; ia fill VALVEand a safety device are installed on 

1 r the bottle. ; " • * r — * :v '■■ r * -^-V*- 1 " " 1 ^ ' - 

[2] The*spool VALVE supplies PRESSURE to the hydraulic motor. fr ' , : 
[3] If the isolation VALVE cuts off the PRESSURE tb the system application of therbrakeus automatic. 
[4] The PRESSURE goes through the seconc^stage' poppet ^ of theTshutcff VAl^ to the high 
"20 : PRESSURE ports of the spool VALVE! 1* vi-.; ■ ' 4 - -v i. ; ' ; '' 

? [5] A PRESSURE 'relief -VALVE prevents^ overpressure' in the hydraulic. system. : 

[6] A bleed-air regulating and relief VALVE controls the air-PRESSURE in the system reservoir. 
' [7] The off loader VALVE- decreases the PRESSURE to : 2750 - 3430 kPa- (400-500 ,psi) if the 
hydraulic systems are hot used. " e ' *" : -* L ' v 

25 [8] Two vacuum relief- VALVEs prevent a negative PRESSURE. 

: [9] The selector 1 VAliVE supplies oil PRESSURE- td i wove the piston in the control cylinder. 
v; [10] The systemraccumulatbr nitrogen-lines connect the gas chamber of the system accumulator to 
: charging VALVE and its PRESSURE gage.' % - ' ' * : u '< f 

Again, each of these sentences must 6e analysed with the aid of a posing r system; in order to establish * 
30 -thVs^^ sgritehee. Cbfc^thef; syntactic^ stmctore?is^ available, each of the strictures 

can l&'examined inord&Mto deteWhihe-v^ethefr ^ .^.v-tam :r*-&w> r-jr* ^ ? w< 
" Vl - 1) thetwo key wdrds^are' directly connected to each other in the syntactic struc^re; of ; . 
2) the two key words are linked to each other by sorhfe intervening node. ; f - ^ > 
The following table shows the kind of information which can be extracted from such structures after 
35 each of the sentences has been parsed and the corresponding parse structure has been established.- 

1 switch V valve + switch ATTRIBUTE pressure <: ~ f 

2 supply SUBJECT valve + supply OBJECT pressure ^ 

3 cut SUBJECT valve + cut OBJECT pressure ■ < 
- 4 port OF valve ; + port ATTRIBUTE pressure- . c ^ 

" 40 1 ■ 4 5 valve ATTRIBUTE relief + relief ATTRIBUTE pressure : 1 - 
' r: 6 control SUBJECT valve^ xonfrbl OBJECT pressure - - - v 

7 decrease SUBJECT valve' + decrease OBJECT pressure u> 'I *5 . " 
T "8 prevent SUBJECT valvS + prevent OBJECT pressure ^ ; - 

v supply SUBJECT valve + Supply OBJECT pressure ; * ^ 

is v itivdVeAHDvige'** ga^-ATTRIBUtE^pressure " - ' 

• As the table ' showis,' the words -PRESSURE arid VALVE, although, dissimilar in meaning* are neverthe- 
v : i^ss linked to each other 5 by their ^ relation^ to omer words such as ^switch?, supply*;' ^cuf!; "port", "relief", 
w corifrol\ "decrease" 1 , "prevent'-'and "gage w v Identifying these syntactic connections in the context makes it 
possible ; not only to estimate the 'degree or-strengfr^of association^ between any given:iwords. but also to 
so identify the kind of assdciatfori involved. It is immediately clear from^the above.table that; the- dominating 
fypS of association- is that in -which VALVE is the subject., and PRESSURE the direct- object, of some 
" ' :: cbmmbh "verb- 1 the 'actual r verbs 1 " erirountered. in this relation Jn^the above table are ^supply", "cut", 
" c -cbi^r.-^decrease^ arid-"^ a clear characterization; of the function of a valve 

n; - -' v«th' regard tb pressure. : ^ ^ ^ ; - ° ^ A r l - ij -' : " ~ K - : J "T 1 , J ' ' Ul . 

i -~55 : - This : pofe^ proves ^partculariy . valuable for 

making a choice in cases of ambiguity in collocations with an implicit relation... such as noun strings in 
English. In the above example it so happen d that in the sentences retrieved, only indirect r lations were 
found between the two key words, but a direct relation might well have been found in the corpus, as in the 
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collocation "pressure valve". This would incidentally have strengthened ithe; index of ; association between 
th two words. The explicit characterization » f that association is obtained, from^ the t indirect connections 
sh wn above. Just as in example 1, the : degr e or strength of the association between two words can be 
numerically expressed as;a function of the, number of connecting, relations found between the two words 
5 and as a function of the total number of relations for the words themselves. 

~ ■ The* degree of semantic association, when expressed, in- a suitable form, also has a role to play in 
' machine translation programs. This can be illustrated , with ,the following example sentences: 

[1 ] Remove the pins from the bandages. ^ * . r . ■ , 1 ;• 
J2] Remove the pins:from the bolts. 7 . , , f . _ : - _ ■ ; - 

to tf in the language into which these English sentences are to be. translated (e.g. Dutch) it is necessary to 
clearly differentiate between different translations of the word "pin" (e.g. the Dutch word "speld", meaning a 
'sharp-pointed fastener' in the first sentence, and Dutch "splitpen", meaning *a kind of peg* in the second 
sentence), then in the course of translation a point will be reached at which a choice has to be made. The 
relation between the word "pin" and the word "remove" does not help in this case, because both kinds of 

75 pin can equally well be removed. The solution of the problem of word choice thus depends on establishing 
a link between one of the alternative translations of "pin" and the translation of "bandage", and between 
one of the alternative translations of "pin" and the translation of "bolt". In other words, the choice depends 
on the degree of association between the above-mentioned words as determined on the basis of the 
contextual patterns they exhibit in the target language (the language into which the text is being translated). 

20 If the degree of this association is determined using the method according to the invention, it will 
appear that the Dutch word for "bandages" has a stronger association with the Dutch word "spekT than it 
does with the word "splitpen". On the other hand, the Dutch word for "bolts" will show a stronger 
association with the word "splitpen" than it does with the word "speld". Thus, on the basis of the strength 
of the observed association, a correct choice can be made for the translation of the ambiguous word "pin". 

25 The stronger the association between the relevant words, the greater the confidence with which this choice 
can be made. 



Claims 

30 

1 . A method for determining the degree to which two or more lexical items belonging to a predefined 
corpus of text in any given language are semantically related to each other, comprising the following steps: 

a) the retrieval from the said text corpus of a set of sentences in which one or more of the given two 
or more lexical items appear, 
35 b) the parsing, with the aid of a suitable parsing system, of each of the sentences retrieved, in order 

to determine the syntactic dependency structure of each of the said sentences, 

c) for each sentence retrieved, determining from the obtained syntactic dependency structure the 
contextual relations which the given lexical items have in that sentence, i.e. identifying those items in the 
context which have a syntactic relation to those of the given lexical items which appear in the sentence 

40 concerned, together with the syntactic relations involved, 

d) determining, for each of the given lexical items, the total number of contextual relations found in 
step c), 

e) determining the number of contextual relations which the given lexical items have in common, 

f) determining, on the basis of the results obtained in steps d) and e), the degree of overlap between 
45 the contextual patterns of the given two or more lexical items. 

2. A method according to claim 1, characterized in that step f) is subdivided into two parts: 

f1) determining the number of common contextual relations which can be expected by chance alone, 
f2) comparing the number obtained by step f 1 ) with the number obtained by step e). 

3. A method according to claim 2. characterized in that the comparison in step f2) is performed by 
so evaluating the following formula: semantic reiatedness = (C-Ey(C + K), where 

C = the number of common contextual relations obtained by step e) 

E = th number of relations to be expected by chance alone, as obtained by step f1) 

K = a constant. 

4. A method according to claim 2, characterized in that . where the number of common contextual 
55 relations to be expected by chance alone, as obtained by step f1), is larger than the number of common 

relations obtained by step e), the comparison in step f2) is performed by evaluating the following formula: 
reiatedness = (C-E)/(E + K). 

5. A method according to claim 2, 3 or 4, characterized in that the result of step fl) is determined by 
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evaluating the following formula: E = A * B/f(N); where " ^ = .*<..■ ^ 1 t v. 

A = the number of relations obtain^ ' *" f ' l " 

B * the number of relations obtained iri st p d) f or th s cbnd lexical item, ^ . . - ~ - : 

t(N) a a function of the number of different relations, M, in the total above-mentioned predefined corpus of 
'5 text * v i - ^ ,<.■;; : 

6. Amethod according to claim' 1, characterized in that the degree, of contextual .overlap mention d in 
step f)is obtained by determining the sum of the numbers of common relations obtained by step d) for the 
individual lexical items, and then dividing the result by the number obtained by step e); 

7. A method according to claim 6, characterized in that the said sum is multiplied i>y the number of 



io lexical items for which the degree of relatedness is being determined:^ ■:■ 
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